Development and Internal Validation of Risk Assessment Models for Chronic Obstructive Pulmonary Disease in Coal Workers

Coal workers are more likely to develop chronic obstructive pulmonary disease due to exposure to occupational hazards such as dust. In this study, a risk scoring system is constructed according to the optimal model to provide feasible suggestions for the prevention of chronic obstructive pulmonary disease in coal workers. Using 3955 coal workers who participated in occupational health check-ups at Gequan mine and Dongpang mine of Hebei Jizhong Energy from July 2018 to August 2018 as the study subjects, random forest, logistic regression, and convolutional neural network models are established, and model performance is evaluated to select the optimal model, and finally a risk scoring system is constructed according to the optimal model to achieve model visualization. The training set results show that the logistic, random forest, and CNN models have sensitivities of 78.55%, 86.89%, and 77.18%; specificities of 85.23%, 92.32%, and 87.61%; accuracies of 81.21%, 85.40%, and 83.02%; Brier scores of 0.14, 0.10, and 0.14; and AUCs of 0.76, 0.88, and 0.78, respectively, and similar results are obtained for the test set and validation set, with the random forest model outperforming the other two models. The risk scoring system constructed according to the importance ranking of random forest predictor variables has an AUC of 0.842; the evaluation results of the risk scoring system shows that its accuracy rate is 83.7% and the AUC is 0.827, and the established risk scoring system has good discriminatory ability. The random forest model outperforms the CNN and logistic regression models. The chronic obstructive pulmonary disease risk scoring system constructed based on the random forest model has good discriminatory power.


Introduction
Chronic obstructive pulmonary disease (COPD) is a common preventable respiratory disease characterized by persistent airflow limitation, which is associated with an increased chronic inflammatory response of the airways and lungs to toxic particles or gases. COPD has a high prevalence and mortality, and it is the third leading cause of death worldwide; the global prevalence of COPD in 2019 was 13.1%, with prevalence rates ranging from 11.6% to 13.9% in different regions of the world [1]. COPD not only affects lung function but also has extrapulmonary effects that affect the whole body, with common comorbidities including cardiovascular disease, lung cancer, osteoporosis, anxiety, and depression [2]. COPD has serious health hazards for individuals and there is no effective way to slow down the progression of the disease in the present. Once the condition of COPD patients deteriorates, not only will their lung function level decrease, but also increase the mortality rate and disability rate [3]. Smoking, air pollution, biomass fuels, and occupational dust exposure are considered to be important risk factors for COPD. Due to the particularity of the working environment, coal workers are often exposed to dust, chemical substances, and other

Research Object
This study relies on China's key Research and Development program "Cohort Study on Health Effects of Occupational Groups in Beijing-Tianjin-Hebei Region", and 3955 coal workers who participated in occupational health examinations in Gequan Mine and Dongpang Mine in Hebei province from July 2018 to August 2018 are the research objects.
Inclusion criteria: 18~60 years old, ≥1 year of service. Exclusion criteria: those who could not measure lung function, i.e.,: those who had undergone chest, abdominal, or eye surgery in the past 3 months, those who were pregnant or breastfeeding, and those who had been hospitalized for heart disease in the past 1 month; those who had missing information from the questionnaire.
The study was conducted in accordance with the Declaration of Helsinki, verified and approved by the Ethics Committee of the North China University of Technology (15006), and all study subjects voluntarily participated in this investigation and signed an informed consent form.

Data Collection
Personal information was obtained through questionnaires, which are administered to workers by professional staff in a one-to-one manner. The content of the questionnaire mainly includes the following sections: (1) demographic information: age, gender, ethnicity, marital status, education level, economic income, etc.; (2) behavioral lifestyle: smoking, drinking status, dietary conditions, physical activity, sleep quality; (3) personal history of diseases: hypertension, diabetes, tumors; (4) work status: nature of employment, length of service, type of work, shift situation.

Physical Examination
(1) Height and weight: measurements were obtained with the Dekang DK-08-C height and weight meter, for which the subjects should remove shoes, hats, watches, and other items that affect the test results and the measurements should be obtained in the correct position according to the instructions of the relevant personnel. (2) Pulmonary function test: pulmonary function measurements were obtained as instructed by the staff, where the subject should sit quietly, sit with the upper body straight, keep the head horizontal, clip on the nose clip, and put on the mouthpiece according to the instructions of the professional staff before the test, while ensuring that the tongue cannot block the mouthpiece or leak air.

Definition of Ending
The pulmonary function test was performed by professionals using a portable spirometer (China CHEST) to measure mainly the first and second expiratory volume with force (FEV 1 ), force spirometry (FVC), and according to the 2017 Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines [9], FEV 1 /FVC < 70% is diagnosed as COPD.

Drinking Status
In this study, drinking status was categorized as never drinking, formerly abstained from drinking, and current drinking.

Physical Exercise
Exercise was determined by exercising more than 3 times a week and for more than half an hour each time.

Physical Activity
In this study, the International Physical Activity Questionnaire (IPAQ) was used to investigate the physical activity of coal workers [10]. Physical activity was classified as "low", "medium", and "high" according to intensity, frequency, and overall weekly physical activity level. The overall weekly physical activity level < 600 MET-min/w is considered low, the overall weekly physical activity level 600 to MET-min/w is considered medium, and the overall weekly physical activity level 3000~MET-min/w is considered high.

Sleep Quality
The Athens Insomnia Scale (AIS) was applied to assess the sleep quality of coal workers [11], with scores <4 being accessibility, with scores 4-6 being suspected insomnia and scores >6 being insomnia.

Cumulative Dust Exposure (CDE)
The criteria for determining dust exposure in this study are based on the "Determination of Dust in Workplace Air Part 1: Total Dust Concentration", and the cumulative individual dust exposure is calculated based on the total dust concentration in the workplace measured by a qualified testing company and the actual results of daily testing [12].
C n is the annual geometric mean concentration in mg/m 3 for a job performed by a coal worker; T n is the duration of dust pick-up in years for a job performed by a worker. The specific grouping is as follows: <50, 50~, and 100~.

Shift Situation
A system of working hours in which the production process requires 24 h of continuous work, guaranteed by one or several teams working in shifts determines the shift situation. This study classifies shift work situations into the following three situations, never shifted, ever shifted, and now shifted [13].

Ventilation and Dust Removal Measures
The evaluation of ventilation and dust removal measures were combined with the evaluation results of the inspection company and the evaluation of the operation of the facility in the daily work of coal workers. The specific grouping is as follows: difference, ordinary, and good.

Statistical Methods
The counts were expressed as rates, and the chi-square test was used for comparison between groups; unconditional logistic regression was used for multi-factor analysis. Through a large number of literature review and collection of relevant data, univariate analysis of relevant factors was carried out, and the variables meaningful for univariate analysis were further incorporated into multivariate analysis, and the influencing factors of COPD of coal workers were finally determined. The statistical tests were all two-sided, and the test level was α = 0.05. All of this was carried out in the SPSS 22.0 statistical software (IBM, Armonk, NY, USA).

Model Establishment
In this study, sklearn.model_selection.train_test_split was used to divide the dataset into training set, test set, and validation set according to 7:2:1 (Supplementary Material S1). The screening of model predictors was carried out through univariate analysis, multivariate analysis, and literature review to construct a risk assessment model.
Logistic regression model is a classification algorithm that uses a sigmoid function for classification and is implemented in this study using the Sklearn. Logistic Regression module (Supplementary Material S2).
The convolutional neural network model consists of convolutional layer, pooling layer, activation layer, and finally a fully connected layer for the classification output. In this study, the CNNs were constructed using keras, the activation function is Relu, the loss function is binary_crossentropy and the optimizer is rmsprop (Supplementary Material S3).
Random forest model is essentially a collection of multiple decision trees and is an ensemble learning method. The random forest model is built using the Random Forest Classifier module in sklearn, and the parameters are tuned by the learning curve and the grid search method RandomizdSearchCV. In this model, the following parameters were adjusted, including the tree tree n_estimators estimators, the maximum depth of the tree max_depth, the number of randomly selected features max_festures, and the minimum number of samples min_samples_split, in order to ensure a good learning ability and generalization ability to avoid overfitting (Supplementary Material S4).
All models are built in Python 3.10.

Model Evaluation
The performance of the model was evaluated in terms of both discrimination and calibration.
Discrimination is a measure of a model's ability to distinguish between patients and non-patients and, and commonly evaluated metrics include sensitivity, specificity, accuracy, ROC curve, and its area under the curve AUC.
Calibration is a measure of the accuracy of a model in assessing the future occurrence of an outcome event for an individual, and commonly used measures are Brier score and calibration curve. Calibration curve is an important method to evaluate the calibration of a model, it can visually measure the consistency between the predicted probability and the true probability of the model; the closer the curve is to the diagonal line means the better the calibration of the model.

Establishment of a Risk Scoring System
The optimal model was derived from the development and evaluation of a COPD risk assessment model for coal workers, on the basis of which a risk scoring system was established.
2.9.1. Risk Scoring System A risk scoring system was constructed using an assignment method based on the importance ranking of the optimal model predictor variables, which involve being assigned in the following manner.
The hazard score corresponding to each independent variable S n is the relative importance of the respective variable I n divided by the smallest relative importance I m , i.e., Total hazard fraction S c is the sum of the individual hazard scores, i.e., Combining the results of the single-factor and multi-factor analyses, the risk factors are set to a maximum value and the protective factors are set to a minimum value of 0. The risk scores for each factor is displayed in the results section of the risk scoring system. 1 When the variable is dichotomous, it is assigned to 0, S n ; 2 When the variable is a trivial variable, it is assigned to 0, S n /2, S n ; 3 When the variable is a four-category variable, it is assigned as 0, S n /3, 2S n /3, S n ; 2.9.2. Mapping the ROC Curve of a COPD Risk Scoring System for Coal Workers A risk scoring system was constructed by randomly selecting 70% of the participants, and an ROC curve is drawn according to their score and whether they have COPD.

Setting up Hazard Stratification
According to the ROC curve of the COPD risk scoring system, the maximum M of the Jordon index was found on the ROC curve, and the study subjects were divided into two levels: low-risk population (S c < M) and high-risk population (S c ≥ M).

Performance Evaluation of Risk Scoring Systems
1. The remaining 30% of workers, classified according to the above classification criteria, were used to calculate the accuracy rate of the risk scoring system. 2. The area under the ROC curve was used to determine the diagnostic value of the risk scoring system.
The area under the ROC curve ≤ 0.5 indicates that the risk scoring system has no diagnostic value. The area under the ROC curve 0.5~0.7 indicates that the risk scoring system has diagnostic value. The area under the ROC curve 0.7~0.8 indicates that the risk scoring system has good diagnostic value. The area under the ROC curve > 0.8 indicates that the diagnostic value of the risk scoring system is sufficient, and the sensitivity and specificity of the risk scoring system are high, which can better identify for disease.

Quality Control
Pre-survey training was provided to investigators and information entry for the questionnaire was carried out in pairs to ensure the accuracy of the data. When performing pulmonary function measurement, staff should instruct participants to perform measurements in accordance with standard movements to ensure the quality of pulmonary function test and increase the accuracy and reliability of outcome diagnosis. Factor analysis and review of the literature ensured that factors associated with outcomes were included in the model and that appropriate statistical analysis methods were used.

Analysis of General Demographic Characteristics
This study includes 3955 study participants, of which 918 coal workers have COPD, with a prevalence rate of 23.2%. A univariate analysis of the relationship between general demographic characteristics of coal workers and COPD shows that age, gender, education, household income, and BMI are all associated with COPD, with statistically significant differences (p < 0.05), as detailed in

Analysis of the Health Status of Coal Workers
The univariate analysis of the relationship between the health status of coal workers and COPD shows that the personal history of respiratory diseases is associated with COPD, and the difference is statistically significant (p < 0.05), as detailed in Table 2

Lifestyle Analysis of Coal Worker Behavior
Through the univariate analysis of the relationship between coal workers' behavior and lifestyle and COPD, the result shows that smoking index, physical exercise, vegetable intake, and fruit intake are all related to COPD, and the differences are statistically significant (p < 0.05), as detailed in Table 3.

Analysis of Occupational Harmful Factors of Coal Workers
A univariate analysis of the relationship between occupational factors and COPD in coal workers showed that seniority, cumulative dust exposure, ventilation, and dust removal measures, mask usage and chemical poison exposure are all associated with COPD, with statistically significant differences (p < 0.05); see Table 4 for details.

Multivariate Analysis of Influencing Factors of COPD among Coal Workers
The meaningful influencing factors of univariate analysis were used as input variables to perform unconditional logistic regression analysis for coal workers' COPD, and the assignment method is shown (Table 5). Multicollinearity diagnosis of independent variables requiring inclusion in multivariate analysis shows ( Table 6) that variance inflation factors (VIF) are greater than 0 and less than 10, and a tolerance greater than 0.1 for all variables. The result shows ( Table 7) that age 30 and above, male, history of respiratory diseases, smoking index 1 and above, cumulative dust exposure 50 and above, working experience of 10 years and above, and exposure to chemical poisons are risk factors for COPD in coal workers (all p < 0.05), and with a bachelor's degree (junior college) or above, physical exercise, and from3-4 days/week to the daily use of masks along with generally good ventilation and dust removal measures are protective factors for the occurrence of COPD in coal workers (all p < 0.05). Cumulative dust exposure <50~= 1, 50~= 1100~= 3 X 15 Chemical poison exposure 0 = no, 1 = yes

Model Results
According to the result of the multi-factor analysis and literature review, a risk assessment model was constructed by including age, gender, education level, personal history of respiratory diseases, smoking index, physical exercise, seniority, mask usage, ventilation and dust removal measures, cumulative dust exposure, and chemical poison exposure.
In the training set (Table 8), the sensitivity, specificity, accuracy, and AUC of random forest are 86.89%, 92.32%, 85.40%, and 0.88, respectively, which are higher than those of the CNN and logistic models. The Brier score and Log loss of random forest are 0.10 and 0.35, respectively, which are lower than those of the CNN and logistic models, and the random forest model has the best performance.  In the test set (Table 8), the sensitivity, specificity, accuracy, and AUC of random forest are 81.86%, 87.06%, 85.10%, and 0.82, respectively, which are higher than those of the CNN and logistic models. The Brier score and Log loss of random forest are 0.13 and 0.41, respectively, which are lower than those of the CNN and logistic models, and the random forest model has the best performance.
In the validation set (Table 8), the sensitivity, specificity, accuracy, and AUC of random forest are 82.93%, 84.30%, 83.11%, and 0.78, respectively, which are higher than those of the CNN and logistic models. The Brier score and Log loss of random forest are 0.11 and 0.37, respectively, which are lower than those of the CNN and logistic models, and the random forest model has the best performance.
The calibration curve of the random forest (Figure 1a-c) is closer to the diagonal line, indicating that the model's predicted value is closer to the true value. The ROC curve (Figure 2a-c) shows that the random forest model outperforms the other two models in all three sets. The calibration curve of the random forest (Figure 1a-c) is closer to the diagonal line, indicating that the model's predicted value is closer to the true value. The ROC curve (Figure 2a-c) shows that the random forest model outperforms the other two models in all three sets.
In summary, the random forest model outperforms the CNN and logistic models in the risk assessment of COPD in coal workers.
The optimal model is the random forest model and the variables are ranked in importance according to the optimal model. The result is shown in Figure 3, where chemical poison exposure, cumulative dust exposure, mask usage, and smoking index are the important predictor variables for the random forest model.   In summary, the random forest model outperforms the CNN and logistic models in the risk assessment of COPD in coal workers.
The optimal model is the random forest model and the variables are ranked in importance according to the optimal model. The result is shown in Figure 3

Risk Scoring System
Based on the model evaluation, the optimal model is the random forest model, on which the risk scoring system is constructed. The risk scoring system was constructed using the assignment method according to the importance of the predictor variables (Figure 3), and the assignment method is shown in Table 9. A risk scoring system was constructed for a random sample of 70% of the study subjects and ROC curves were plotted according to their scores and whether they have COPD, the results of which are shown in Figure 4, with an AUC of 0.842. Risk stratification was set: a risk score of 23.05 has the highest Jorden index; therefore, a risk score < 23.05 is defined as low risk and a risk score ≥ 23.05 as high risk.
The remaining 30% of the study subjects was used to evaluate the performance of the risk scoring system. The study population was assigned a risk score according to Table 8 and classified according to the classification criteria. The result shows ( Table 10) that 774 people in the low-risk group are normal and 52 have COPD, and 141 of the high-risk population are normal and 220 have COPD. The accuracy of the risk scoring system is 83.7%, and the AUC of the ROC curve is 0.827 ( Figure 5), indicating that the established risk scoring system has good discriminating ability.

Risk Scoring System
Based on the model evaluation, the optimal model is the random forest model, on which the risk scoring system is constructed. The risk scoring system was constructed using the assignment method according to the importance of the predictor variables (Figure 3), and the assignment method is shown in Table 9. A risk scoring system was constructed for a random sample of 70% of the study subjects and ROC curves were plotted according to their scores and whether they have COPD, the results of which are shown in Figure 4, with an AUC of 0.842. Risk stratification was set: a risk score of 23.05 has the highest Jorden index; therefore, a risk score < 23.05 is defined as low risk and a risk score ≥ 23.05 as high risk.   The remaining 30% of the study subjects was used to evaluate the performance of the risk scoring system. The study population was assigned a risk score according to Table 8 and classified according to the classification criteria. The result shows ( Table 10) that 774 people in the low-risk group are normal and 52 have COPD, and 141 of the high-risk population are normal and 220 have COPD. The accuracy of the risk scoring system is 83.7%, and the AUC of the ROC curve is 0.827 ( Figure 5), indicating that the established risk scoring system has good discriminating ability.

Discussion
Coal meets 27% of the world's energy needs, supplies 40% of the world's electricity, and is an important pillar of China's industry [14]. A large number of coal workers are exposed to dust, noise, vibration, and high heat, which can lead to occupational diseases

Discussion
Coal meets 27% of the world's energy needs, supplies 40% of the world's electricity, and is an important pillar of China's industry [14]. A large number of coal workers are exposed to dust, noise, vibration, and high heat, which can lead to occupational diseases such as pneumoconiosis, noise deafness, vibration sickness, and various chronic diseases [15,16]. Our study is dedicated to the physical health of coal workers and we have constructed a risk assessment model and a risk scoring system suitable for COPD in coal workers.
A total of 3955 coal workers were included in the study, with a COPD prevalence rate of 23.2%, which is higher than that of the general population [17]. Older age was a risk factor for COPD in this study, with an OR of 1.770 (1.063-2.948), which is consistent with the result of related study [18]. This may be related to lung ageing, reduced lung function, and reduced immunity of the lungs to environmental injury [19]. The study found that being male is a risk factor for COPD, with an OR of 3.965 (2.172-7.247). The higher risk of disease in males may be due to the fact that male coal workers are more likely to smoke, but there is also a study that suggests the risk of COPD in females is increasing [20]. This may be related to women's greater exposure to biomass fuels, higher sensitivity to cigarette smoke, and a faster decline in FEV 1 in female smokers [21,22]. This study focuses on coal workers, who are far more male than female, so there may be some bias in the investigation of the effect of gender on COPD. Personal history of respiratory disease is a risk factor for COPD in this study, and it mainly refers to a history of tuberculosis and asthma. Asthma is an important cause of the acceleration of FEV 1 reduction [23]. Tuberculosis is an important cause of airflow obstruction and respiratory symptoms [24,25]. The result of this study, which quantifies smoking in coal workers using a smoking index, suggests that smoking is a risk factor for COPD, which has been considered a major risk factor for COPD in many previous studies [26,27]. This may be due to the fact that cigarette smoke stimulates the release of inflammatory cytokines from respiratory cells, leading to respiratory damage [28,29]. Dust is an important occupational factor for coal workers, and this study quantifies the dust exposure of coal workers by using cumulative dust exposure. The OR values of cumulative dust exposure exceeding 50 mg/m 3 and 100 mg/m 3 per year are 1.382 (1.039-1.837) and 2.228 (1.638-3.029), respectively, and the increase in cumulative dust exposure will lead to an increased risk of COPD among coal workers. The possible reason for this is that coal dust can inactivate α-1 antitrypsin and produce reactive oxygen species, that α-1 antitrypsin inactivation increases the risk of COPD, and that reactive oxygen species may lead to emphysema in miners [30]. Seniority refers to the number of years of exposure to dust, and in this study 10 years or more of service can lead to an increased risk of COPD among coal workers. Exposure to chemical poison is also an occupational hazard for coal workers, that mainly refers to inhalation of irritant gases and fumes. Chemical poison exposure usually activates alveolar macrophages and leukocytes, leading to the release of reactive oxygen species, which leads to inflammatory changes in the airways and increases the risk of COPD [31]. Masks and ventilation and dust removal measures are important dust prevention measures for coal workers, and in this study, they are protective factors that can reduce the risk of COPD among workers [32]. These protective measures are important in a high-risk environment such as coal mines to achieve primary prevention of occupational diseases. Physical exercise is a protective factor in this study, and those who carry out physical exercise have a lower risk of COPD, suggesting that increasing physical exercise among coal workers can reduce the decline in FEV 1 [33]. Previous studies have found that physical activity is the strongest predictor of all-cause mortality in COPD patients [34]. It is also an important measure of pulmonary rehabilitation in COPD patients [35]. The level of education above a bachelor's degree is a protective factor for COPD, which may be associated with good lifestyle habits and minimal dust exposure in those with high levels of education [36].
In this study, the dataset was divided into three sets: training set, test set, and verification set; and three models of logistic, random forest, and convolutional neural network were established to evaluate the risk of COPD in coal workers. The performance of the models was evaluated from the aspects of discrimination and calibration. The results show that the random forest model has the best performance, with a sensitivity of 81.86% (test set) and a specificity of 87.06% (test set), which is more suitable for the risk assessment of COPD in coal workers. The random forest model is an improvement on the decision tree model that is widely used in the medical field and outperforms other models in some studies [37,38]. In this study, the CNN is better than the logistic model but not as good as the random forest model. In one study, Sandeep Bodduluri uses a machine learning algorithm to distinguish between the structural phenotypes of slow-onset lungs, in which the AUC of CNN and random forest models are 0.80 and 0.78, respectively, and CNN performs better [39]. CNN performs differently in different studies, which may be related to the type of data. CNN achieves better results in the recognition of images, and the application effect in other areas varied depending on the data. The logistic model performs the worst in this study, indicating that the model's predicted values deviate significantly from the actual values and is not suitable for the risk assessment of COPD in coal workers. The importance ranking of the predictors of the random forest model indicates that chemical poison exposure, cumulative dust exposure, mask usage, smoking index, and ventilation and dust removal measures are important predictors, and the importance ranking of predictors indicates measures that coal workers can employ to achieve higher health benefits. This study constructs a risk scoring system for COPD based on the importance ranking of the optimal model random forest predictor variables and evaluates the risk scoring system with an accuracy of 83.7% and an AUC of 0.827, indicating that the scoring system has good discriminatory ability. The establishment of the risk scoring system explores the application value of the model, which can calculate the individual risk score according to their health data, evaluate the risk of COPD of individual occurrences, and provide a reference basis for the health management of coal workers.
There are some limitations of this study. First, biomass fuels and air pollution are also important influence factors of COPD [40]. However, due to the design of the questionnaire and the collection of the samples, we lack data on this component, so we are unable to include these two variables in the study. In addition, this is a cross-sectional study and therefore inferior to prospective cohort studies in verifying causality. Due to data collection limitations, we did not include coal workers over 60 years of age, which may have led to selective bias. Follow-up studies can survey retired workers to assess the effect of age on coal workers' COPD. In this study, we did not stage COPD, taking into account the use of the model and the distribution of pulmonary function test data. If COPD is not staged, it may be difficult to extrapolate the model because of differences in the distribution of data. The contribution of our study is mainly to provide a risk assessment model for COPD in coal workers and to construct a risk scoring system based on the risk assessment model. As pulmonary function testing is low among coal workers in daily life, our risk scoring system can be used to assess the risk of COPD among coal workers without pulmonary function testing using health check-up data, and to make targeted recommendations based on the individual's relevant circumstances, thereby protecting the health of coal workers. The innovation of this paper lies in the fact that, firstly, our research is based on the data obtained from field surveys to explore the relevant influencing factors of the disease, then we used the obtained data to build a risk assessment model suitable for the research object, and finally we realized the model visualization by building a risk scoring system, which increases the applicability of the model.
According to the conclusions of our COPD study of coal workers, the following measures can effectively reduce the occupational hazards of coal dust for coal workers. Ventilation and dust removal measures are important protective measures, so water injection into coal seams, the adoption of new dust prevention technologies, and ensuring the good functioning of ventilation systems in the workplace can help reduce coal workers' dust exposure. Carrying out health education for coal workers, strengthening workers' aware-ness of dust prevention and the use of masks, and encouraging workers to develop healthy lifestyle habits, such as quitting smoking and exercising, are all important measures.

Conclusions
In this study, the analysis of the relevant data of coal workers shows that an age 30 years old and above, male, personal history of respiratory diseases, smoking index 1 and above, cumulative dust exposure 50 mg/m 3 and above, seniority ≥ 10 years, and exposure to chemical poison are risk factors for COPD in coal workers (all p < 0.05). A bachelor's degree (junior college) and above, physical exercise, at least 3-4 days/week use of masks, and good ventilation and dust removal measures are protective factors for COPD among coal workers.
The random forest model is better than the CNN and logistic models in assessing COPD risk in coal workers. The COPD risk scoring system was constructed based on the random forest model that has better discriminatory ability.